《计算机应用》唯一官方网站

• •    下一篇

面向语音增强的双复数卷积注意聚合递归网络

余本年1,詹永照1,毛启容1,2,董文龙1,刘洪麟1   

  1. 1.江苏大学 计算机科学与通信工程学院,江苏 镇江 212013
    2.江苏省大数据泛在感知与智能农业应用工程研究中心,江苏 镇江 212013


  • 收稿日期:2022-10-13 修回日期:2022-12-25 接受日期:2022-12-28 发布日期:2023-04-12 出版日期:2023-04-12
  • 通讯作者: 詹永照
  • 基金资助:
    国家自然科学基金重点项目;江苏省重点研究开发计划

Double complex convolutional and attention aggregating recurrent network for speech enhancement

  • Received:2022-10-13 Revised:2022-12-25 Accepted:2022-12-28 Online:2023-04-12 Published:2023-04-12
  • Contact: ZHAN Yong-zhao
  • Supported by:
    National Natural Science Foundation of China;Jiangsu Province Key Research and Development Program

摘要: 针对现有语音增强方法对图谱特征关联信息表达有限和去噪效果不理想的问题,提出一种双复数卷积注意聚合递归网络(DCCARN)。首先,建立双复数卷积网络,对短时傅里叶变换后的语谱图特征分别进行两分支信息编码;其次,将两分支中编码分别用特征块间和块内注意力机制对不同的语音特征信息进行重标注;然后,经长短期记忆(LSTM)处理长时间序列信息,再经两解码器还原语谱特征并将特征聚合;最后,经短时逆傅里叶变换生成目标语音波形,达到抑制噪声目的。在公开数据集VBD和加噪的TIMIT数据集上分别进行实验,结果表明,与相位感知的深度复数卷积递归网路(DCCRN)相比,DCCARN在客观语音质量评估(PESQ)上分别提升了5.597%和2.672%。验证了所提方法能更准确地捕获图谱特征上的关联信息并更有效抑制噪声和增强语音清晰度。

关键词: 语音增强, 注意力机制, 复数卷积网络, 编码, 长短期记忆

Abstract: Aiming at the problems of limited representation of graph feature correlation information and unsatisfactory denoising effect in existing speech enhancement methods, a Double Complex Convolutional Attention Aggregation Recurrent Network (DCCARN) was proposed. First, a double-complex convolutional network was established to encode the two-branch information of the spectrogram features after the short-time Fourier transform. Secondly, the encoders in the two branches were respectively used for different feature-block and intra-block attention mechanisms, and speech feature information was re-labeled. Then, the long-term sequence information was processed by Long-Short-Term-Memory (LSTM), and the spectral features were restored and aggregated by two decoders. Finally, the estimated speech waveform was generated by short-time inverse Fourier transform to activate the purpose of suppressing noise. Experiments are carried out on the public dataset Voice Bank + DMAND (VBD) and the noised the DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus(TIMIT)dataset. The results show that, compared with the phase-aware Deep Complex Convolutional Recurrent Network (DCCRN), DCCARN is more effective in Perceptual Evaluation of Speech Quality (PESQ) increased by 5.597% and 2.672% respectively. It is verified that the proposed method can more accurately capture the correlation information on the speech features, suppress noise more effectively and enhance speech intelligibility.

Key words: speech enhancement, attention mechanism, complex convolutional network, coding, LSTM(Long Short Term Memory)
 

中图分类号: